Adrian Tran, Harshitha Bachina, Hayden King, Josh Shneyder
Summary of Data
Data Description
In this final project report, we decided to explore the relationship between the inflation-adjusted GDP per capita in US dollars and the Life Expectancy at birth. We retrieved this data from Gapminder, which is a data source that combines data from multiple sources into a coherent time series. The life expectancy data set shows the average number of years a newborn child would live if current mortality rates stay the same. The data set includes data on almost every country and includes rates from 1799 to 2099. The inflation-adjusted GDP per capita in US dollars covers almost every country from 1959 to 2019, but there is some missing data in the earlier years for less developed countries. We joined these data sets together to analyze the relationship between GDP per capita and Life Expectancy between 1959 and 2019.
This project often utilizes the \(R^2\) value as a comparative measure for goodness of fit for a given model. This statistical measure determines the variation of a dependent variable that can be explained by an independent variable. A higher \(R^2\) signifies a stronger relationship between the two variables. The value of the statistic ranges between 0 and 1, with 1 signifying a perfect relationship and 0 signifying no correlation at all.
Linear Visualization
To visualize the relationship between these variables, we started with a scatterplot.
Code
ggplot(data = joined_data, aes(x = GDP, y = Life_Expectancy)) +geom_point(alpha =0.1, color ="orange") +labs(x ="GDP per capita",y ="",subtitle ="Life Expectancy",title ="Relationship between GDP per capita and Life Expectancy" ) +geom_smooth(method ="lm")
The relationship between \(GDP\) and \(LifeExpectancy\) is approximately logarithmic, so we did a log transformation of \(GDP\). We called the new variable \(lGDP\). \[lGDP = ln(GDP)\]
Logarithmic Visualization
Code
lineargraph =ggplot(data = joined_data, aes(x = lGDP, y = Life_Expectancy)) +geom_point(alpha =0.1, color ="orange") +labs(x ="Ln(GDP per capita)",y ="",subtitle ="Life Expectancy",title ="Relationship between Ln(GDP per capita) and Life Expectancy" ) +geom_smooth(method ="lm")lineargraph
Code
logmodel =lm(Life_Expectancy ~ lGDP, data = joined_data)
We will fit a regression line to this transformed data. From here on out, we will examine only the relationship between \(lGDP\) and \(Life Expectancy\).
Linear Regression of Transformed Data
Assumptions of the Linear Regression Model
To fit a linear regression we need to check 5 assumptions. Some of these assumptions are violated to differing degrees, but none of them are severely violated. The high sample size makes it so that very small violations of the assumptions can be detected.
1. Linearity
As shown above, the data still shows some curvature after the log transformation.
2. Random Sampling/ No Autocorrelation
The assumption of random sampling isn’t met, but we can substitute the assumption of no autocorrelation. The data shows partial autocorrelation of about 0.2 which is relatively stable across various lags. The Durbin Watson test gives a p-value of 0 for the autocorrelation, meaning there is certainly autocorrelation in the data, so this assumption isn’t met either. The blue dotted lines represent a 95% significance level.
Code
joined_data <- joined_data |>mutate(residuals = logmodel$residuals)acf(joined_data$residuals, type ="correlation")
3. No colinearity
Since there is only one explanatory variable (\(lGDP\)), there can’t be any colinearity.
4. Homoskedasticity
Homoskedasticity means that the residuals at each point on the regression line have equal variance. In other words, the fitted values cannot predict the residuals. We conducted the Breusch-Pagan test to determine if Heteroskedasticity (unequal variances of residuals across the regression line) is present in the data.
Code
bptest(logmodel)
studentized Breusch-Pagan test
data: logmodel
BP = 301.4, df = 1, p-value < 2.2e-16
\[BP = n \times R^2_{\hat u^2} = 301.4 \]\[9162 \times R^2_{\hat u^2} = 301.4\]\[R^2_{\hat u^2} = 0.033\] While the data isn’t homoskedastic, the Breusch-Pagan test statistic shows that only 3.3% of the variance in \(u_i\) can be explained by the variance in \(lGDP\). We can say this because \(var(u_i) = E(u^2_i) - E(u_i)^2\) and \(E(u_i)=0\). While this heteroskedasticity is very much present with a p-value of 0, that is more a function of the huge sample size than a significant amount of the variation in \(u^2\) being explained by \(lGDP\).
While the distribution is skewed left, it isn’t severe. This is because of the low outliers mentioned earlier. Any negative effects of this assumption not being met should be made up for by the high sample size.
Model Fit
Linear regression models can be assessed by how much of the variance in the response they account for. We can calculate this with the \(R^2\) statistic mentioned earlier.
Code
variance_table =matrix(nrow =1, ncol =3)colnames(variance_table) =c("Variance of Response","Variance of Fitted Values","Variance of Residuals")variance_table[1, 1] =var(joined_data$Life_Expectancy)variance_table[1, 2] =var(logmodel$fitted.values)variance_table[1, 3] =var(logmodel$residuals)variance_table[1,] =round(variance_table[1,], digits =2)kable(variance_table) |>kable_material_dark()
Variance of Response
Variance of Fitted Values
Variance of Residuals
97.63
59.34
38.29
\[R^2 = 1-\frac{38.29}{97.63} = \frac{59.34}{97.63} = 0.608\] The variance in \(lGDP\) accounts for 60.8% of the variance in \(Life Expectancy\). This is a moderate \(R^2\), so this model does a decent job of explaining the response. Considering how many factors may influence \(Life Expectancy\) across countries and time, this is a good quality model.
Estimated Model
The following model was obtained by minimzing the squared residuals of a linear model fitted to the data. This process is called OLS (Ordinary Least Squares.) \[\widehat{Life Expectancy} = 22.84 + 5.312lGDP\] The logarithmic transformation of GDP per capita makes the linear model fit quite well. This model predicts that if GDP per capita increases by 1%, mean \(LifeExpectancy\) increases by approximately 0.05312 years. At a GDP per capita of 1 dollar, mean \(LifeExpectancy\) is 22.84 years.
Changes Over Time
The regression model we estimated combines data for the years 1959 through 2019. Since so much has changed over the course of those 60 years, we estimated a separate OLS model for each year to look at the changes over time.
Code
years =1959:2019lms_year =matrix(nrow =length(years), ncol =2)lms_year[, 1] = yearscoef_year =matrix(nrow =length(years), ncol =3)coef_year[, 1] = yearsfor (y in years) { year_data = joined_data |>filter(Year == y) year_model =lm(Life_Expectancy ~ lGDP, data = year_data) lms_year[y -1958, 2] =summary(year_model)$r.squared coef_year[y -1958, 2] = year_model$coefficients[1] coef_year[y -1958, 3] = year_model$coefficients[2]}lms_year =data.frame(lms_year)colnames(lms_year) =c("Year", "Rsquared")coef_year =data.frame(coef_year)colnames(coef_year) =c("Year", "Beta0", "Beta1")ggplot(data = lms_year, aes(x = Year, y = Rsquared)) +geom_line(color ="steelblue") +ylim(0, 1) +labs(y ="",subtitle ="R-Squared",title ="R-Squared of ln(GDP per capita) and Life Expectancy over time")
While variation in \(lGDP\) has consistently explained about 60% of variation in \(LifeExpectancy\), the nature of this relationship has changed significantly over time. There have also been dips in the \(R^2\) corresponding to outliers in the data.
Code
library(patchwork)beta0 =ggplot(data = coef_year, aes(x = Year, y=Beta0)) +geom_line(color ="steelblue") +labs(y ="",subtitle ="Estimated Intercept",title ="Estimated Intercept Over Time")beta1 =ggplot(data = coef_year, aes(x = Year, y=Beta1)) +geom_line(color ="steelblue") +labs(y ="",subtitle ="Estimated Slope",title ="Estimated Slope of ln(GDP per capita) Over Time")beta0+beta1
The left graph shows how the mean \(LifeExpectancy\) when GDP per capita is 1 dollar has increased dramatically over time. The right graph shows that while a 1% increase in GDP per capita increased \(LifeExpectancy\) by almost 0.07 years in 1959, it now does so by only around 0.04 years.
Differences Between Countries
There are no countries in the data that have suddenly increased their life expectancy without increasing their GDP, resulting in a high positive residual. There are, however, several low outliers in the data that result from countries going through traumatic events that decrease life expectancy. These low outliers can be seen on the interactive scatterplot below.
countries = joined_data |>select(country) |>distinct()lms_country =matrix(nrow =length(countries), ncol =2)lms_country[, 1] = countrieslms_country =data.frame(lms_country)colnames(lms_country) =c("Country", "Rsquared")idnum =1for (c in countries$country) { country_data = joined_data |>filter(country == c) country_model =lm(Life_Expectancy ~ lGDP, data = country_data) lms_country[idnum , "Rsquared"] =summary(country_model)$r.squared idnum = idnum +1}ggplot(data = lms_country, aes(x = Rsquared)) +geom_histogram(binwidth =0.05, fill ="steelblue") +labs(x ="R-squared",y ="",subtitle ="Count",title ="Histogram of Countries by R-Squared of ln(GDP per capita) and Life Expectancy" )
While the relationship between \(lGDP\) and \(Life Expectancy\) is quite strong for most countries, for some countries it is much weaker. This is mostly due to events that decrease life expectancy such as epidemics, famines, natural disasters and wars. Botswana for example had a linear relationship between \(lGDP\) and \(Life Expectancy\) until the AIDS crisis hit. This drop in \(Life Expectancy\) while \(lGDP\) continued to grow is responsible for its \(R^2\) of 0.0013.
Countries that have suffered from natural disasters and prolonged civil wars make up most of the countries with weak relationships.
Countries with a Strong Relationship
Code
joined_data |>filter(country =="Germany") |>ggplot(aes(x = lGDP, y = Life_Expectancy)) +geom_point() +labs(x ="ln(GDP per capita)",y ="",subtitle ="Life Expectancy",title ="Life Expectancy and ln(GDP per capita) in Germany" ) +geom_smooth(method ="lm")
In countries without life expectancy-decreasing events since 1959, there can be a very strong linear relationship. In the case Germany, the \(R^2\) is 0.983.
We are going to do two simulations to validate our model.
Simulated Distribution
First, we generated a simulated distribution by taking each of the fitted values and adding noise based on the standard error of the residuals.
Code
set.seed(1738)sigma =sigma(logmodel)predictions =predict(logmodel)simulation <-function(x, rse) { x <- x +rnorm(n =length(x), mean =0, sd = rse)return(x)}actualhist =ggplot(data = joined_data, aes(x = Life_Expectancy)) +geom_histogram(binwidth =2, fill ="steelblue") +labs(x ="Life Expectancy",y ="",subtitle ="Count",title ="Observed Distribution" )simhist =ggplot(data = joined_data, aes(x =simulation(predictions, sigma))) +geom_histogram(binwidth =2, fill ="steelblue") +labs(x ="Simulated Life Expectancy",y ="",subtitle ="Count",title ="Simulated Distribution" )actualhist + simhist
Because of the left skew of the data, the simulated distribution doesn’t exactly match the actual distribution. Due to the high sample size (9162 observations,) this isn’t a huge issue as far as fitting the model.
Multiple Predictive Checks
Second, we generated 2000 simulated data sets using the method above and found the \(R^2\) of each one. This shouldn’t be compared to the \(R^2\) of 0.608 found earlier.
Code
simulate <-function() { x = predictions +rnorm(n =length(joined_data$lGDP),mean =0,sd =6.188) simmodel =lm(joined_data$Life_Expectancy ~ x)return(summary(simmodel)$r.squared)}ggplot(data =NULL, aes(x =map_dbl(1:2000, ~simulate()))) +geom_histogram(fill ="steelblue") +labs(x ="Simulated R-squared of Life Expectancy and Simulated ln(GDP per capita)",y ="",subtitle ="Count",title ="R-squared Simulation" )
Conclusion
After exploring the data and fitting a model, we can conclude that there is a moderate positive relationship between ln(GDP per capita) and life expectancy. This relationship has held over time, although it is sometimes thrown off by traumatic events.
Source Code
---title: "Exploring the Relationship between GDP per capita and Life Expectancy"author: "Adrian Tran, Harshitha Bachina, Hayden King, Josh Shneyder"format: html: self-contained: true code-tools: true toc: true code-fold: trueeditor: sourceexecute: error: true echo: true message: false warning: false---```{r}#| include: falsegdp =read.csv("~/Desktop/stat-331-71-PA6/gdppercapita_us_inflation_adjusted.csv")life =read.csv("~/Desktop/stat-331-71-PA6/life_expectancy_years.csv")library(tidyverse)library(ggplot2)library(broom)library(knitr)library(kableExtra)library(lmtest)library(AER)gdp_clean <- gdp |>pivot_longer(cols ="X1959":"X2019",names_to ="Year",values_to ="GDP")life_clean <- life |>pivot_longer(cols ="X1959":"X2019",names_to ="Year",values_to ="Life_Expectancy") |>select("country", "Year", "Life_Expectancy")joined_data =merge(gdp_clean, life_clean, by =c("Year", "country")) |>mutate(Year =as.numeric(str_remove(Year, "X")),thousand =if_else(str_detect(GDP, "k"), T, F),GDP =as.double(str_remove(GDP, "k")),GDP =if_else(thousand, GDP *1000, GDP),lGDP =log(GDP) ) |>select(-thousand) |>drop_na()```# Summary of Data## Data DescriptionIn this final project report, we decided to explore the relationship between the inflation-adjusted GDP per capita in US dollars and the Life Expectancy at birth. We retrieved this data from Gapminder, which is a data source that combines data from multiple sources into a coherent time series. The life expectancy data set shows the average number of years a newborn child would live if current mortality rates stay the same. The data set includes data on almost every country and includes rates from 1799 to 2099. The inflation-adjusted GDP per capita in US dollars covers almost every country from 1959 to 2019, but there is some missing data in the earlier years for less developed countries. We joined these data sets together to analyze the relationship between GDP per capita and Life Expectancy between 1959 and 2019.This project often utilizes the $R^2$ value as a comparative measure for goodness of fit for a given model. This statistical measure determines the variation of a dependent variable that can be explained by an independent variable. A higher $R^2$ signifies a stronger relationship between the two variables. The value of the statistic ranges between 0 and 1, with 1 signifying a perfect relationship and 0 signifying no correlation at all.## Linear VisualizationTo visualize the relationship between these variables, we started with a scatterplot.```{r}ggplot(data = joined_data, aes(x = GDP, y = Life_Expectancy)) +geom_point(alpha =0.1, color ="orange") +labs(x ="GDP per capita",y ="",subtitle ="Life Expectancy",title ="Relationship between GDP per capita and Life Expectancy" ) +geom_smooth(method ="lm")```The relationship between $GDP$ and $LifeExpectancy$ is approximately logarithmic, so we did a log transformation of $GDP$. We called the new variable $lGDP$.$$lGDP = ln(GDP)$$## Logarithmic Visualization```{r}lineargraph =ggplot(data = joined_data, aes(x = lGDP, y = Life_Expectancy)) +geom_point(alpha =0.1, color ="orange") +labs(x ="Ln(GDP per capita)",y ="",subtitle ="Life Expectancy",title ="Relationship between Ln(GDP per capita) and Life Expectancy" ) +geom_smooth(method ="lm")lineargraphlogmodel =lm(Life_Expectancy ~ lGDP, data = joined_data)```We will fit a regression line to this transformed data. From here on out, we will examine only the relationship between $lGDP$ and $Life Expectancy$.# Linear Regression of Transformed Data## Assumptions of the Linear Regression ModelTo fit a linear regression we need to check 5 assumptions. Some of these assumptions are violated to differing degrees, but none of them are severely violated. The high sample size makes it so that very small violations of the assumptions can be detected.### 1. LinearityAs shown above, the data still shows some curvature after the log transformation.### 2. Random Sampling/ No AutocorrelationThe assumption of random sampling isn't met, but we can substitute the assumption of no autocorrelation. The data shows partial autocorrelation of about 0.2 which is relatively stable across various lags. The Durbin Watson test gives a p-value of 0 for the autocorrelation, meaning there is certainly autocorrelation in the data, so this assumption isn't met either. The blue dotted lines represent a 95% significance level.```{r}joined_data <- joined_data |>mutate(residuals = logmodel$residuals)acf(joined_data$residuals, type ="correlation")``````{r}#| include: falsedwtest(logmodel)```### 3. No colinearitySince there is only one explanatory variable ($lGDP$), there can't be any colinearity.### 4. HomoskedasticityHomoskedasticity means that the residuals at each point on the regression line have equal variance. In other words, the fitted values cannot predict the residuals. We conducted the Breusch-Pagan test to determine if Heteroskedasticity (unequal variances of residuals across the regression line) is present in the data.```{r}bptest(logmodel)```$$BP = n \times R^2_{\hat u^2} = 301.4 $$$$9162 \times R^2_{\hat u^2} = 301.4$$$$R^2_{\hat u^2} = 0.033$$While the data isn't homoskedastic, the Breusch-Pagan test statistic shows that only 3.3% of the variance in $u_i$ can be explained by the variance in $lGDP$. We can say this because $var(u_i) = E(u^2_i) - E(u_i)^2$ and $E(u_i)=0$. While this heteroskedasticity is very much present with a p-value of 0, that is more a function of the huge sample size than a significant amount of the variation in $u^2$ being explained by $lGDP$.### 5. Normality of $u$ (Residuals)```{r}ggplot(data = joined_data, aes(x = residuals)) +geom_histogram(fill ="steelblue") +labs(y ="",subtitle ="Count",x ="Residuals",title ="Histogram of Residuals" )```While the distribution is skewed left, it isn't severe. This is because of the low outliers mentioned earlier. Any negative effects of this assumption not being met should be made up for by the high sample size.## Model FitLinear regression models can be assessed by how much of the variance in the response they account for. We can calculate this with the $R^2$ statistic mentioned earlier.```{r}variance_table =matrix(nrow =1, ncol =3)colnames(variance_table) =c("Variance of Response","Variance of Fitted Values","Variance of Residuals")variance_table[1, 1] =var(joined_data$Life_Expectancy)variance_table[1, 2] =var(logmodel$fitted.values)variance_table[1, 3] =var(logmodel$residuals)variance_table[1,] =round(variance_table[1,], digits =2)kable(variance_table) |>kable_material_dark()```$$R^2 = 1-\frac{38.29}{97.63} = \frac{59.34}{97.63} = 0.608$$The variance in $lGDP$ accounts for 60.8% of the variance in $Life Expectancy$. This is a moderate $R^2$, so this model does a decent job of explaining the response. Considering how many factors may influence $Life Expectancy$ across countries and time, this is a good quality model.## Estimated ModelThe following model was obtained by minimzing the squared residuals of a linear model fitted to the data. This process is called OLS (Ordinary Least Squares.)$$\widehat{Life Expectancy} = 22.84 + 5.312lGDP$$The logarithmic transformation of GDP per capita makes the linear model fit quite well. This model predicts that if GDP per capita increases by 1%, mean $LifeExpectancy$ increases by approximately 0.05312 years. At a GDP per capita of 1 dollar, mean $LifeExpectancy$ is 22.84 years.## Changes Over TimeThe regression model we estimated combines data for the years 1959 through 2019. Since so much has changed over the course of those 60 years, we estimated a separate OLS model for each year to look at the changes over time.```{r}years =1959:2019lms_year =matrix(nrow =length(years), ncol =2)lms_year[, 1] = yearscoef_year =matrix(nrow =length(years), ncol =3)coef_year[, 1] = yearsfor (y in years) { year_data = joined_data |>filter(Year == y) year_model =lm(Life_Expectancy ~ lGDP, data = year_data) lms_year[y -1958, 2] =summary(year_model)$r.squared coef_year[y -1958, 2] = year_model$coefficients[1] coef_year[y -1958, 3] = year_model$coefficients[2]}lms_year =data.frame(lms_year)colnames(lms_year) =c("Year", "Rsquared")coef_year =data.frame(coef_year)colnames(coef_year) =c("Year", "Beta0", "Beta1")ggplot(data = lms_year, aes(x = Year, y = Rsquared)) +geom_line(color ="steelblue") +ylim(0, 1) +labs(y ="",subtitle ="R-Squared",title ="R-Squared of ln(GDP per capita) and Life Expectancy over time")```While variation in $lGDP$ has consistently explained about 60% of variation in $LifeExpectancy$, the nature of this relationship has changed significantly over time. There have also been dips in the $R^2$ corresponding to outliers in the data.```{r}library(patchwork)beta0 =ggplot(data = coef_year, aes(x = Year, y=Beta0)) +geom_line(color ="steelblue") +labs(y ="",subtitle ="Estimated Intercept",title ="Estimated Intercept Over Time")beta1 =ggplot(data = coef_year, aes(x = Year, y=Beta1)) +geom_line(color ="steelblue") +labs(y ="",subtitle ="Estimated Slope",title ="Estimated Slope of ln(GDP per capita) Over Time")beta0+beta1```The left graph shows how the mean $LifeExpectancy$ when GDP per capita is 1 dollar has increased dramatically over time. The right graph shows that while a 1% increase in GDP per capita increased $LifeExpectancy$ by almost 0.07 years in 1959, it now does so by only around 0.04 years.## Differences Between CountriesThere are no countries in the data that have suddenly increased their life expectancy without increasing their GDP, resulting in a high positive residual. There are, however, several low outliers in the data that result from countries going through traumatic events that decrease life expectancy. These low outliers can be seen on the interactive scatterplot below.```{r}library(plotly)plot_ly( joined_data,x =~ lGDP,y =~ Life_Expectancy,text =~paste("Country: ", country, '<br>Year:', Year))```### Low Outliers```{r}countries = joined_data |>select(country) |>distinct()lms_country =matrix(nrow =length(countries), ncol =2)lms_country[, 1] = countrieslms_country =data.frame(lms_country)colnames(lms_country) =c("Country", "Rsquared")idnum =1for (c in countries$country) { country_data = joined_data |>filter(country == c) country_model =lm(Life_Expectancy ~ lGDP, data = country_data) lms_country[idnum , "Rsquared"] =summary(country_model)$r.squared idnum = idnum +1}ggplot(data = lms_country, aes(x = Rsquared)) +geom_histogram(binwidth =0.05, fill ="steelblue") +labs(x ="R-squared",y ="",subtitle ="Count",title ="Histogram of Countries by R-Squared of ln(GDP per capita) and Life Expectancy" )```While the relationship between $lGDP$ and $Life Expectancy$ is quite strong for most countries, for some countries it is much weaker. This is mostly due to events that decrease life expectancy such as epidemics, famines, natural disasters and wars. Botswana for example had a linear relationship between $lGDP$ and $Life Expectancy$ until the AIDS crisis hit. This drop in $Life Expectancy$ while $lGDP$ continued to grow is responsible for its $R^2$ of 0.0013.```{r}outliers <- joined_data |>filter(abs(residuals) >22) |>mutate(lGDP =round(lGDP, digits =2),residuals =round(residuals, digits =2))kable( outliers,col.names =c("Year","Country","GDP per Capita","Life Expectancy","ln(GDP per Capita)","Residual" )) |>kable_material_dark("hover")```These low residuals are associated with several significant events, including [the Cameroonian War of Independence and resulting Civil War](https://en.wikipedia.org/wiki/Bamileke_War), [the Great Leap Forward and subsequent famine](https://en.wikipedia.org/wiki/Great_Leap_Forward), [the 1993 Rwandan Genocide](https://en.wikipedia.org/wiki/Rwandan_genocide) and [the AIDS crisis in Southern Africa](https://www.unicef.org/botswana/hiv) (particularly in Botswana.) We are unsure about the cause of the dips in life expectancy in Haiti in 2009 and Burundi in 1972. It would make more sense if life expectancy dipped in 2010 in Haiti [due to a 7.0 magnitude earthquake](https://en.wikipedia.org/wiki/2010_Haiti_earthquake) and Burundi in 1973 [due to a genocide of Hutus.](https://en.wikipedia.org/wiki/Ikiza) There may be a quirk in the data collection process. According to the authors of the dataset, some of the data points are ["rough guesstimates."](https://www.gapminder.org/data/documentation/gd004/)### Countries with a Weak Relationship```{r}joined_data |>filter(country =="Botswana") |>ggplot(aes(x = lGDP, y = Life_Expectancy)) +geom_point() +labs(x ="ln(GDP per capita)",y ="",subtitle ="Life Expectancy",title ="Life Expectancy and ln(GDP per capita) in Botswana" ) +geom_smooth(method ="lm")```Botswana had an approximately linear relationship between $lGDP$ and $LifeExpectancy$ until the AIDS crisis hit.```{r}lms_country |>mutate(Rsquared =round(Rsquared, digits =3)) |>slice_min(order_by = Rsquared, n =8) |>kable() |>kable_material_dark("hover")```Countries that have suffered from natural disasters and prolonged civil wars make up most of the countries with weak relationships.### Countries with a Strong Relationship```{r}joined_data |>filter(country =="Germany") |>ggplot(aes(x = lGDP, y = Life_Expectancy)) +geom_point() +labs(x ="ln(GDP per capita)",y ="",subtitle ="Life Expectancy",title ="Life Expectancy and ln(GDP per capita) in Germany" ) +geom_smooth(method ="lm")```In countries without life expectancy-decreasing events since 1959, there can be a very strong linear relationship. In the case Germany, the $R^2$ is 0.983.```{r}lms_country |>mutate(Rsquared =round(Rsquared, digits =3)) |>slice_max(order_by = Rsquared, n =8) |>kable() |>kable_material_dark("hover")```## SimuluationWe are going to do two simulations to validate our model.### Simulated DistributionFirst, we generated a simulated distribution by taking each of the fitted values and adding noise based on the standard error of the residuals.```{r}set.seed(1738)sigma =sigma(logmodel)predictions =predict(logmodel)simulation <-function(x, rse) { x <- x +rnorm(n =length(x), mean =0, sd = rse)return(x)}actualhist =ggplot(data = joined_data, aes(x = Life_Expectancy)) +geom_histogram(binwidth =2, fill ="steelblue") +labs(x ="Life Expectancy",y ="",subtitle ="Count",title ="Observed Distribution" )simhist =ggplot(data = joined_data, aes(x =simulation(predictions, sigma))) +geom_histogram(binwidth =2, fill ="steelblue") +labs(x ="Simulated Life Expectancy",y ="",subtitle ="Count",title ="Simulated Distribution" )actualhist + simhist```Because of the left skew of the data, the simulated distribution doesn't exactly match the actual distribution. Due to the high sample size (9162 observations,) this isn't a huge issue as far as fitting the model.### Multiple Predictive ChecksSecond, we generated 2000 simulated data sets using the method above and found the $R^2$ of each one. This shouldn't be compared to the $R^2$ of 0.608 found earlier.```{r}simulate <-function() { x = predictions +rnorm(n =length(joined_data$lGDP),mean =0,sd =6.188) simmodel =lm(joined_data$Life_Expectancy ~ x)return(summary(simmodel)$r.squared)}ggplot(data =NULL, aes(x =map_dbl(1:2000, ~simulate()))) +geom_histogram(fill ="steelblue") +labs(x ="Simulated R-squared of Life Expectancy and Simulated ln(GDP per capita)",y ="",subtitle ="Count",title ="R-squared Simulation" )```# ConclusionAfter exploring the data and fitting a model, we can conclude that there is a moderate positive relationship between ln(GDP per capita) and life expectancy. This relationship has held over time, although it is sometimes thrown off by traumatic events.